cAdvisor was the tool that democratised container monitoring. And remains relevant — kubelet includes it internally. But in 2024, observing containers well requires more layers: cluster state metrics, eBPF for deep visibility, APM for application context. This article maps what to combine and how.
The Modern Minimum Stack
For serious Kubernetes in 2024:
- kubelet / cAdvisor: CPU, memory, network, disk metrics per container.
- kube-state-metrics: state of Deployments, Pods, ReplicaSets, HPA.
- node-exporter: node metrics.
- Prometheus scrapes everything, aggregates.
- Grafana visualises.
This covers 80% of what you need to monitor. And is a solid OSS stack.
What’s Missing Without eBPF
cAdvisor gives “surface” metrics:
- CPU usage %.
- Memory RSS.
- Network bytes.
- Disk I/O.
But not:
- Syscall latency: is the pod stuck on I/O?
- Network latencies between specific pods.
- CPU profile: which functions consume.
- Function-level detail: hotpaths.
For this, eBPF is the modern tool.
eBPF: The Changing Layer
Modern eBPF tools:
Pixie
Pixie (CNCF sandbox, originally New Relic):
- Auto-instrumentation HTTP/gRPC/DNS without sidecar or code changes.
- Live flame graphs.
- Automatic service map.
- PxL-language queries.
One per-node eBPF agent + web UI. Developer-friendly.
Grafana Beyla
- Auto-instrumentation for Go, Java, Node apps.
- Generates OpenTelemetry traces without code modification.
- Grafana stack integration.
Simpler than Pixie, focused on traces/metrics.
Parca
- Continuous profiling of the whole cluster.
- eBPF flame graphs.
- Grafana-integrable.
Specific for CPU profiling.
Inspektor Gadget
Kinvolk/Microsoft tool for eBPF debugging:
kubectl traceequivalents.- Per-pod network snapshots.
- On-demand profiling.
APM: The Application Layer
eBPF gives infra visibility; APM gives application visibility:
- OpenTelemetry: open standard, increasingly adopted.
- Jaeger / Tempo: trace backends.
- Datadog / New Relic / Dynatrace: complete commercial.
- Grafana Tempo: tempo.
With OTel SDK, your app instruments:
- Request spans.
- Business metrics.
- Correlated logs.
Beyla can auto-generate some, but for business metrics, you need SDK.
Combining Without Saturating
Common error: all tools = massive overhead. Typical sweet spot:
- cAdvisor + kube-state-metrics + node-exporter: light base.
- eBPF (Pixie or Beyla): add when needing deep visibility.
- APM with OTel: for critical apps, not all.
- Commercial APM: only with clear use case vs OSS.
Each layer should add distinct value. Duplicating is waste.
Essential Per-Container Metrics
Always monitored:
- CPU throttling: is the pod rate-limited?
- Memory working set: real use, not RSS.
- OOM kills: key counter.
- Network errors: TX/RX drops.
- Disk pressure: fullness + I/O saturation.
- Restart count: flapping = problem.
For K8s additionally:
- Pod phase: Pending, Running, Failed.
- Readiness probe failures.
- HPA desired vs current.
- PVC usage.
Alerts Worth Having
Few but effective:
- Pod restart > N in Y minutes: flapping.
- Sustained CPU throttling > 50%: insufficient resource.
- OOM kills: always investigate.
- Memory > 90% limit sustained: leak or sizing.
- Node not ready > X minutes: incident.
- HPA at max replicas for > Y min: capacity issue.
Fewer useful alerts > many ignored alerts.
Dashboards: What to Show
Typical levels:
- Cluster overview: total resources, nodes, pods.
- Namespace: per team/application.
- Workload: specific deploy, pods, containers.
- Pod detail: drill-down for troubleshoot.
Official Kubernetes Grafana dashboards (IDs 315, 6417, etc) are good starting points.
Logging Integration
Metrics without logs is half the picture. Typical stack:
- Fluent Bit or Loki-native shippers for log collection.
- Loki for storage + Grafana for visualization.
- Trace correlation via trace IDs.
When investigating an incident, you need metrics + logs + traces on the same timeline.
Security Observability
Complementary:
- Falco: eBPF runtime security.
- Tracee (Aqua): similar, eBPF-based.
- Kubernetes API audit logs.
Not standard “monitoring” but part of the complete picture.
Tools No Longer Recommended
- Telegraf: valid but Prometheus ecosystem is default now.
- Standalone InfluxDB: Loki + Prometheus cover.
- Legacy Stackdriver: GCP-only, lock-in.
- ELK for metrics: Elastic better for logs alone.
A Practical Example
Typical 50-node, 500-pod cluster:
- Prometheus federation: ~2000 targets, 5M series.
- Retention: 30 days hot + object storage.
- Grafana with 10-15 curated dashboards.
- 20-30 useful alerts.
- Beyla in some namespaces for traces.
- Loki for logs.
OSS stack, ~5% overhead of total cluster in CPU/RAM.
Conclusion
Monitoring containers well in 2024 requires more than cAdvisor. The OSS base (Prometheus + kube-state-metrics + node-exporter + Grafana) is solid and sufficient for most. eBPF (Pixie, Beyla, Parca) adds deep visibility when needed. APM with OpenTelemetry complements with application vision. The trap is over-engineering: more tools = more maintenance and noise. Start with solid base, add layers when use case justifies, and maintain alert discipline — fewer well-thought ones win.
Follow us on jacar.es for more on observability, Kubernetes, and eBPF.